For all problem sets, we will provide you with a single R
Markdown file. Please complete the problem set within a local copy of
this file. To turn in, please upload a fully knitted html version. Make
sure to keep echo=TRUE, as appropriate, to show your
coding.
In Problem Set 6 you were introduced to a stunning dataset analyzing count data for the Angeleno Quail (Callipepla curtii) collected in 2015. As a reminder, three times during the breeding season, surveyors visited 267 survey points doing targeted surveys for the quail.
Last time, you fit a mixed-effects model to the data and found strong effects of elevation, forest cover, an interaction, and wind. You noticed, however, that your dataset included a lot of zeroes, but you didn’t let that disturb you! After this week’s lessons, however, you’ve become a bit more worried, and you wish to try fitting the data in a way that accounts for imperfect detection. Since you have count data, you decide to use a N-mixture model. Like like an occupancy model, but for abundance data, the N-mixture model separates the underlying abundance state process (here, the factors that determine quail abundance) from the observation process (here, how imperfect detection reduces the number of quail actually seen), and models the data as a mixture of these two processes.
After talking to the biologists who surveyed the species, they tell you that they believe that forest cover and elevation likely influence quail abundance, while wind is a very strong factor in their ability to detect quail (but as a variable that changes by the day, wind doesn’t influence their true underlying abundance). The biologists are unsure whether canopy cover (which correlates with ground cover, as well as auditory transmission) impacts detection probability, so it’s probably safer to test for it. You use this natural history knowledge to guide you in your model parameterization.
The data are identical as used for Problem Set 6, so re-load in the R data object.
load("ProblemSet6.Rdata")
Your goal is to run a single N-mixture model in JAGS and learn what you can about the species from the fitted model. The model should include all the major components of this week’s problem set: the effect of elevation, forest cover, the interaction, and wind. Note, however, that you do not have to include a random effect for site with this N-mixture model, because N-mixture – like occupancy – models, assumes that the true underlying state does not change across repeat visits. Thus the “pseudoreplication” of repeat visits is what informs the detection parameter, but abundance covariates are parameterized on the latent variable, \(N\), which is assumed to be unchanging across visits (an assumption called “closure”). You should realize, consequently, that the true abundance at a site (\(N_j\)) is only estimated once for each of \(j\) sites. Consequently, your model should follow the general structure of an N-mixture model, which is:
\(y_{ij}\sim binomial(p_{jk}, N_j)\)
\(N_j \sim Poisson(\lambda_j)\)
where \(p_{jk}\) is a probability of detection at site \(j\) and visit \(k\) that can be modeled as a function of covariates, and \(\lambda_j\) is the expected true abundance at site \(j\), which can also be modeled as a function of covariates.
Draw a DAG for the model you are about the create. Display your
DAG below using the ggdag R package.
Using your DAG, write out the JAGS model code for this model. Use
appropriate vague priors for all parameters. You will need to create two
nested for-loops, a first loop over every site j, and a
second loop over each visit k. Thus, your process model
should be for every count[j,k]. Keep in mind that your
deterministic process and observation models will use different link
functions, which means that a truly ‘vague’ prior for your parameters
may differ depending on whether they are an abundance parameter or a
detection parameter.
qs_mod<-function(){
b0 ~ dnorm(0,0.00001)
b_elev ~ dnorm(0,0.00001)
b_forest ~ dnorm(0,0.00001)
b_int ~ dnorm(0,0.00001)
alpha_0 ~ dnorm(0,0.00001)
alpha_wind ~ dnorm(0,0.00001)
alpha_forest ~ dnorm(0,0.00001)
tau_site ~ dgamma(0.001,0.001)
sigma_site <- 1 / sqrt(tau_site)
for (j in 1:site_num){
N[j] ~ dpois(lam[j])
log(lam[j])<-b0+b_elev*elev[j]+b_forest*forest[j]+b_int*elev[j]*forest[j]+b_site[j]
b_site[j] ~ dnorm(0,tau_site)
for (k in 1:visit_num){
logit(p[j,k])=alpha_0+alpha_wind*wind[j,k]+alpha_forest*forest[j]
count[j,k] ~ dbinom(p[j,k],N[j])
count.new[j,k] ~ dbinom(p[j,k],N[j])
}
}
}
# I Do not think it would be advantagous to use the finding sfrom last week's analysis as informed priors for the current analysis. The issue is that in last week analysis, one of the underlying assumptions is that there is perfect observation of the indiviudals within the community.However, in this week we are adding imperfect observaiton into the model. Thus if we use last week's estimated parameters as priors, the priors may not necessarily cover all the pisterior distirbution of the parameter this weeks.More speciically, it is very likely that the posterior distirbution for the covariates related to counts such as b_0 b_forest we gained form this week would be consistently larger than what have been got last week due to the effect of imperefect observaiton being added in.
#Furthermore, we are also removign the effect of wind out from the covaraits of counting, thus this would chagne the strucure of the model, which would cause the priors to be inaccurate to be used in this week's analysis.
Node inconsistent with parents error from JAGS, this
indicates that you need to give JAGS plausible initial values for the
latent variable, N.set.seed(2025)
count=ps.data[["count"]]
elevation=ps.data[["elev"]]
forest=ps.data[["forest"]]
wind=ps.data[["wind"]]
avg_count=apply(count,1,mean)
site_num=nrow(wind)
visit_num=ncol(wind)
site=rep(1:site_num,each=visit_num)
visit=rep(1:visit_num,times=site_num)
total_sample=site_num*visit_num
jags_data_qs<-list(
site_num=site_num,
visit_num=visit_num,
count=count,
wind=wind,
elev=elevation,
forest=forest
)
qs.fit<-jags(data=jags_data_qs,parameters.to.save=c("b0", "b_elev","b_forest","b_int","sigma_site","alpha_0","alpha_wind","alpha_forest"),model.file=qs_mod,
inits=list(list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1)),
n.chains=3,
n.iter=20000,
n.burnin=5000,
n.thin=10)
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 801
## Unobserved stochastic nodes: 1343
## Total graph size: 7492
##
## Initializing model
qs.check<-jags(data=jags_data_qs,parameters.to.save=c("count.new","p"),model.file=qs_mod,
inits=list(list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1),
list(N=apply(count,1,max)+1)),
n.chains=3,
n.iter=20000,
n.burnin=5000,
n.thin=10)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 801
## Unobserved stochastic nodes: 1343
## Total graph size: 7492
##
## Initializing model
MCMCsummary(qs.fit)
## mean sd 2.5% 50% 97.5%
## alpha_0 -0.89359679 0.11327924 -1.12064343 -8.930820e-01 -0.6728587
## alpha_forest 0.01305130 0.16897255 -0.29803828 9.500327e-03 0.3376666
## alpha_wind -2.98527991 0.13919744 -3.24938669 -2.989302e+00 -2.7015073
## b0 0.62427944 0.08908459 0.44919141 6.251070e-01 0.7980839
## b_elev -2.05852386 0.11907291 -2.29740959 -2.058708e+00 -1.8250669
## b_forest 2.14059057 0.12068901 1.90594868 2.143818e+00 2.3740167
## b_int 1.29513484 0.16953891 0.96008017 1.295504e+00 1.6244017
## deviance 1323.42152541 28.44176862 1269.67480612 1.322952e+03 1381.8100583
## sigma_site 0.07152209 0.03542327 0.02278854 6.483815e-02 0.1561256
## Rhat n.eff
## alpha_0 1.01 180
## alpha_forest 1.00 226
## alpha_wind 1.01 238
## b0 1.01 587
## b_elev 1.00 960
## b_forest 1.00 665
## b_int 1.00 1043
## deviance 1.02 216
## sigma_site 1.00 1422
# most of the Rhat are either 1 or close to 1 thus the model has converged
#However, there are count.new values that have Rhat NaN, which could be problematic?
traceplot(qs.fit)
pr_norm=rnorm(15000,mean=0,sd=sqrt(1/0.00001))
pr_gamma=1/sqrt(rgamma(15000,0.01,0.01))
MCMCtrace(qs.fit,params="b0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b0", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_elev",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_elev", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_int",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_int", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="b_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_forest", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_0", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_wind",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_wind", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
MCMCtrace(qs.fit,params="alpha_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_forest", priors = pr_norm, :
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.
#MCMC for abundance is much more efficient than the detection processes. I think one of the main reasons is that the detection parameters are highly correlated to N, hence the value of the abundance parameters in each MCMC step. Thus, as a result, the parameters in the detection processes would demonstrate a higher interrelation between steps when put into a MCMC process: besides the autocorrelation that is introduced by the MCMC process of itself, they are also gaining autocorrelation from the MCMC process of the abundance parameters. Thus, the MCMC tends to be less efficient for the detection processes.
MCMCsummary(qs.fit)
## mean sd 2.5% 50% 97.5%
## alpha_0 -0.89359679 0.11327924 -1.12064343 -8.930820e-01 -0.6728587
## alpha_forest 0.01305130 0.16897255 -0.29803828 9.500327e-03 0.3376666
## alpha_wind -2.98527991 0.13919744 -3.24938669 -2.989302e+00 -2.7015073
## b0 0.62427944 0.08908459 0.44919141 6.251070e-01 0.7980839
## b_elev -2.05852386 0.11907291 -2.29740959 -2.058708e+00 -1.8250669
## b_forest 2.14059057 0.12068901 1.90594868 2.143818e+00 2.3740167
## b_int 1.29513484 0.16953891 0.96008017 1.295504e+00 1.6244017
## deviance 1323.42152541 28.44176862 1269.67480612 1.322952e+03 1381.8100583
## sigma_site 0.07152209 0.03542327 0.02278854 6.483815e-02 0.1561256
## Rhat n.eff
## alpha_0 1.01 180
## alpha_forest 1.00 226
## alpha_wind 1.01 238
## b0 1.01 587
## b_elev 1.00 960
## b_forest 1.00 665
## b_int 1.00 1043
## deviance 1.02 216
## sigma_site 1.00 1422
#The mean of b0 is 0.624, meaning that the average count for all sites, without the consideration of all the other effect is e^0.624
#Elevation has a highly-certain and strong negative impact on quail counts. The mean of b_elev is -2.059. Thus, when elevation increase by 1, the mean count of the site will change by a factor of e^-2.059
#Both canopy cover and the ineraction between canopy cover and elevation have a positive impact on the count fo quails, and we are very much confident about this effect. On average, when canopy cover and the interaction between canopy cover and elevation increase by 1 unit, the expected abundance of quails at that site is increase by a factor of e^2.14 and e^1.29, respectively
#Sigma_site is positive but relative low: 0.0715, suggesting that most of the variation has been explained by the existing parameters and site do not contain any hidden variables that offer a strong explanatory power to the count of quails.
#alpha_0's effect on observation probability is strongly negative with high certainty. The mean of alpha_0 is -0.894, thus the average odd (not probability) of successful observation without considering other effect is e^-0.894.
#alpha_wind has a very strong negative and highly certain effect on observation success. The mean of alpha_wind is -2.99, suggesting that the average odd of successful observation would change by a factor of e^-2.98 when wind increase by 1 unit. Thus we are highly confident that wind negatively impacts our observation success for quails
#alpha_forest has a 95% confidence interval overlap with 0, thus we are uncertain whether it has any actual impact on the observation success or not. Moreover the mean value is small: 0.0131. Thus it is safe to say that canopy cover does not actually contribute much to the detection process of quails
mean(y) and sd(y). Remember
that the model fit well for the mean but not the sd. Repeat this now,
with our N-mixture model, by calculating the observed and posterior
distributions for both test statistics, and plot each (make sure your
plots show both observed lines and the full posterior). Calculate the
Bayesian p-value for each as well.ppd=MCMCchains(qs.check,params="count.new")
ppd_mean=apply(ppd, 1, mean)
ppd_sd=apply(ppd, 1, sd)
obs_mean=mean(count)
obs_sd=sd(count)
df=data.frame(mean=ppd_mean,sd=ppd_sd)
hist(ppd_mean,breaks=20,main="Posterior Prediction Check for Mean",xlab="Simulated Means",ylab="Frequency",xlim=c(min(ppd_mean,obs_mean)-1,max(ppd_mean,obs_mean)+1))
abline(v=obs_mean,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed mean",col="red",lty=2,lwd=2)
hist(ppd_sd,breaks=20,main="Posterior Prediction Check for SD",xlab="Simulated Sd",ylab="Frequency",xlim=c(min(ppd_sd,obs_sd)-1,max(ppd_sd,obs_sd)+1))
abline(v=obs_sd,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed Standard Deviation",col="red",lty=2,lwd=2)
p_mean=mean(ppd_mean>obs_mean)
p_sd=mean(ppd_sd>obs_sd)
p_mean
## [1] 0.4791111
p_sd
## [1] 0.6746667
#The p values are 0 and 1
#this is reasonable beacuse we are including observation probabilty into the model thus the estiamted average count should be larger than the observed value so that an adequate number could be observed
#For the standard deviation, the posterior predictive values are consistently smaller than the observed ones. This can be attributed to the fact that different visits introduce additional noise into the observed dataset. However, this extra variability is accounted for and partially corrected by the inclusion of a random effect in the model in both the abundance and observation process, which reduces the variation in the predicted dataset.
ppd=MCMCchains(qs.check,params="count.new")
b0=MCMCchains(qs.fit,params="b0")
b_elev=MCMCchains(qs.fit,params="b_elev")
b_forest=MCMCchains(qs.fit,params="b_forest")
b_int=MCMCchains(qs.fit,params="b_int")
forest_cov_pred=seq(min(forest),max(forest),by=0.01)
elev_discrete=c(-1,0,1)
plot_list <- list()
for (i in elev_discrete){
median_ppd=c()
lower_ppd=c()
upper_ppd=c()
elev_val_used=c()
prediction=c()
for (j in 1:length(forest_cov_pred)){
forest_cov_val=forest_cov_pred[j]
count=exp(b0+b_elev*i+b_forest*forest_cov_val+b_int*i*forest_cov_val)
lower_ppd[j]=quantile(count,0.055)
upper_ppd[j]=quantile(count,0.945)
elev_val_used[j]=i
prediction[j]=median(count)
}
plot_list[[as.character(i)]] <- data.frame(
forest_cover=forest_cov_pred,
median_pred=prediction,
lower=lower_ppd,
upper=upper_ppd,
elevation=elev_val_used)
}
ggplot(plot_list[["-1"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=-1)",x="Forest Cover",y="Predicted Abundance",col="Key")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(plot_list[["0"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=0)",x="Forest Cover",y="Predicted Abundance",col="Key")
ggplot(plot_list[["1"]],aes(x=forest_cover))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=1)",x ="Forest Cover",y = "Predicted Abundance",col="Key")
#The plots looks similar in terms of their shape. This is reasonable as most of the biological process in terms of how elevation and forest cover are descibing teh abundance are similar between the two models thus they should be similar.
#However there is one thign to notice is that the plots above predict much more count per site when comapred to what have been produced in PS6, this is because that we have now taken into account the process detection, thus the number of quails per sites need to be larger than the amount of quails that are actually observed/the amounts of quails when observation is perfect. Thus the counts in the plots above are much higher than that in PS 6
a0=MCMCchains(qs.fit,params="alpha_0")
a_forest=MCMCchains(qs.fit,params="alpha_forest")
a_wind=MCMCchains(qs.fit,params="alpha_wind")
p_val_ppd=MCMCchains(qs.check,params="p")
wind_cov_pred=seq(min(wind),max(wind),by=0.01)
forest_discrete=c(-1,0,1)
plot_list <- list()
for (i in forest_discrete){
median_ppd=c()
lower_ppd=c()
upper_ppd=c()
forest_val_used=c()
prediction=c()
for (j in 1:length(wind_cov_pred)){
wind_val=wind_cov_pred[j]
p_val=exp(a0+a_forest*i+a_wind*wind_val)/(1+exp(a0+a_forest*i+a_wind*wind_val))
lower_ppd[j]=quantile(p_val,0.055)
upper_ppd[j]=quantile(p_val,0.945)
forest_val_used[j]=i
prediction[j]=median(p_val)
}
plot_list[[as.character(i)]] <- data.frame(
wind_val_plot=wind_cov_pred,
median_pred=prediction,
lower=lower_ppd,
upper=upper_ppd,
forest_val_plot=forest_val_used)
}
ggplot(plot_list[["-1"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=-1)",x="Wind",y="Probability of observation",col="Key")
ggplot(plot_list[["0"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=0)",x="Wind",y="Probability of observation",col="Key")
ggplot(plot_list[["1"]],aes(x=wind_val_plot))+
geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
geom_line(aes(y = median_pred, col="Mean"), size = 1)+
labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=1)",x="Wind",y="Probability of observation",col="Key")